{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you haven't installed dataprep, run command `pip install dataprep` or execute the following cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run me if you'd like to install\n",
    "!pip install dataprep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from dataprep.eda import plot, plot_correlation, plot_missing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://s3-us-west-2.amazonaws.com/dataprep.dsl/datasets/suicide-rate.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot the distribution of each column in the dataframe. \n",
    "For numeric column, show the histogram. For categorical column, show bar chart."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"year\"] = df[\"year\"].astype(\"category\")\n",
    "plot(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Show the plots of the given column. If column is numeric, show keneral density plot, box plot and qqnorm plot.\n",
    "If column is categorical, show bar plot and pie plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(df, \"sex\")\n",
    "plot(df, \"gdp_per_capita\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Show the plots of the relationship of given two columns. \n",
    "* For numeric-categorical, show the box plot for each category.\n",
    "* For numeric-numeric, show the heatmap\n",
    "* For categorical-categorical, show the bar chart of col_x for each category of col_y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(df, \"suicides\", \"sex\")\n",
    "plot(df, \"population\", \"suicides\")\n",
    "plot(df, \"country\", \"generation\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show correlation matrix plots using each method (pearson, kendall, spearman)\n",
    "If k is specified, in each matrix plot, only show top-k positive cells, set the color of other cells to white. (Do you want to know the top-k negative cells?)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_without_missing = df.dropna('columns')\n",
    "plot_correlation(df_without_missing)\n",
    "plot_correlation(df_without_missing, k=1)\n",
    "plot_correlation(df_without_missing, value_range=(0,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Show the 3 cols that corresponds to x in the correlation matrix (pearson, kendall, spearman)\n",
    "if k is specified, sort the result based on corr. show the 3 cols that corresponds the top-k correlation value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_correlation(df_without_missing, \"suicides\")\n",
    "plot_correlation(df_without_missing, \"suicides\", k=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# if value_range is specified, show the correlation value in value_range."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_correlation(df_without_missing, \"suicides\", value_range=[-1, 0.3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# if no correlation in the range, show blank fig."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_correlation(df_without_missing, \"suicides\", value_range=[-1, -0.8])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_correlation(df_without_missing, x=\"population\", y=\"suicides_no\")\n",
    "plot_correlation(df_without_missing, x=\"population\", y=\"suicides\", k=5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## show the location/position and percentage of missing data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_missing(df, num_bins=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## If one want to remove the rows whose x is missing, \n",
    "the impact of the removed rows on other columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_missing(df, 'HDI_for_year')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## If one want to remove the rows whose x is missing, the impact of the removed rows on y columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_missing(df, 'HDI_for_year', 'population')\n",
    "plot_missing(df, 'HDI_for_year', 'sex')\n",
    "plot_missing(df, 'HDI_for_year', \"country\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
